feat: Coalesce small batches before shuffle write by andygrove · Pull Request #3234 · apache/datafusion-comet

andygrove · 2026-01-21T18:28:40Z

Summary

This PR adds batch coalescing before shuffle writes to reduce per-batch overhead and improve vectorization efficiency. When enabled, small columnar batches are combined until they reach the target batch size before being processed by the shuffle writer.

Key changes:

Added spark.comet.shuffle.resizeBatches.input config to enable coalescing batches before shuffle write
Added spark.comet.shuffle.resizeBatches.output config for coalescing after shuffle read
Native planner wraps shuffle input with DataFusion's CoalesceBatchesExec when input coalescing is enabled
Added CometBatchCoalescer Scala class for output-side batch coalescing

Test plan

Verify existing unit tests pass
Run TPC-H Q18 benchmark with spark.comet.shuffle.resizeBatches.input=true
Verify GC metrics improve with the optimization enabled
Test with various batch sizes to ensure correct behavior

🤖 Generated with Claude Code

…ency This change adds batch coalescing before shuffle writes to reduce per-batch overhead and improve vectorization efficiency. When enabled, small columnar batches are combined until they reach the target batch size before being processed by the shuffle writer. Benefits observed in TPC-H Q18 benchmarks: - 10.9% overall query time improvement - Significantly reduced GC pressure (Stage 26: 3,602ms -> 56ms GC time) - Better vectorization efficiency for downstream operators New configuration options: - spark.comet.shuffle.resizeBatches.input: Coalesce batches before shuffle write (default: false) - spark.comet.shuffle.resizeBatches.output: Coalesce batches after shuffle read (default: true) The native planner now wraps shuffle input with DataFusion's CoalesceBatchesExec when spark.comet.shuffle.resizeBatches.input is enabled. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

andygrove force-pushed the coalesce-batches-shuffle branch from fdc9074 to 5cccc1b Compare January 21, 2026 18:29

andygrove changed the title ~~feat: Coalesce small batches before shuffle write for improved efficiency~~ feat: Coalesce small batches before shuffle write to reduce GC pressure Jan 21, 2026

andygrove changed the title ~~feat: Coalesce small batches before shuffle write to reduce GC pressure~~ feat: Coalesce small batches before shuffle write Jan 21, 2026

andygrove closed this Jan 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Coalesce small batches before shuffle write#3234

feat: Coalesce small batches before shuffle write#3234
andygrove wants to merge 1 commit into
apache:mainfrom
andygrove:coalesce-batches-shuffle

andygrove commented Jan 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andygrove commented Jan 21, 2026 •

edited

Loading